43 research outputs found
Probabilistic Protein Function Prediction from Heterogeneous Genome-Wide Data
Dramatic improvements in high throughput sequencing technologies have led to a staggering growth in the number of predicted genes. However, a large fraction of these newly discovered genes do not have a functional assignment. Fortunately, a variety of novel high-throughput genome-wide functional screening technologies provide important clues that shed light on gene function. The integration of heterogeneous data to predict protein function has been shown to improve the accuracy of automated gene annotation systems. In this paper, we propose and evaluate a probabilistic approach for protein function prediction that integrates protein-protein interaction (PPI) data, gene expression data, protein motif information, mutant phenotype data, and protein localization data. First, functional linkage graphs are constructed from PPI data and gene expression data, in which an edge between nodes (proteins) represents evidence for functional similarity. The assumption here is that graph neighbors are more likely to share protein function, compared to proteins that are not neighbors. The functional linkage graph model is then used in concert with protein domain, mutant phenotype and protein localization data to produce a functional prediction. Our method is applied to the functional prediction of Saccharomyces cerevisiae genes, using Gene Ontology (GO) terms as the basis of our annotation. In a cross validation study we show that the integrated model increases recall by 18%, compared to using PPI data alone at the 50% precision. We also show that the integrated predictor is significantly better than each individual predictor. However, the observed improvement vs. PPI depends on both the new source of data and the functional category to be predicted. Surprisingly, in some contexts integration hurts overall prediction accuracy. Lastly, we provide a comprehensive assignment of putative GO terms to 463 proteins that currently have no assigned function
Integration of relational and hierarchical network information for protein function prediction
<p>Abstract</p> <p>Background</p> <p>In the current climate of high-throughput computational biology, the inference of a protein's function from related measurements, such as protein-protein interaction relations, has become a canonical task. Most existing technologies pursue this task as a classification problem, on a term-by-term basis, for each term in a database, such as the Gene Ontology (GO) database, a popular rigorous vocabulary for biological functions. However, ontology structures are essentially hierarchies, with certain top to bottom annotation rules which protein function predictions should in principle follow. Currently, the most common approach to imposing these hierarchical constraints on network-based classifiers is through the use of transitive closure to predictions.</p> <p>Results</p> <p>We propose a probabilistic framework to integrate information in relational data, in the form of a protein-protein interaction network, and a hierarchically structured database of terms, in the form of the GO database, for the purpose of protein function prediction. At the heart of our framework is a factorization of local neighborhood information in the protein-protein interaction network across successive ancestral terms in the GO hierarchy. We introduce a classifier within this framework, with computationally efficient implementation, that produces GO-term predictions that naturally obey a hierarchical 'true-path' consistency from root to leaves, without the need for further post-processing.</p> <p>Conclusion</p> <p>A cross-validation study, using data from the yeast <it>Saccharomyces cerevisiae</it>, shows our method offers substantial improvements over both standard 'guilt-by-association' (i.e., Nearest-Neighbor) and more refined Markov random field methods, whether in their original form or when post-processed to artificially impose 'true-path' consistency. Further analysis of the results indicates that these improvements are associated with increased predictive capabilities (i.e., increased positive predictive value), and that this increase is consistent uniformly with GO-term depth. Additional <it>in silico </it>validation on a collection of new annotations recently added to GO confirms the advantages suggested by the cross-validation study. Taken as a whole, our results show that a hierarchical approach to network-based protein function prediction, that exploits the ontological structure of protein annotation databases in a principled manner, can offer substantial advantages over the successive application of 'flat' network-based methods.</p
Recommended from our members
Type 1 diabetes risk genes mediate pancreatic beta cell survival in response to proinflammatory cytokines
Publisher Copyright: © 2022We combined functional genomics and human genetics to investigate processes that affect type 1 diabetes (T1D) risk by mediating beta cell survival in response to proinflammatory cytokines. We mapped 38,931 cytokine-responsive candidate cis-regulatory elements (cCREs) in beta cells using ATAC-seq and snATAC-seq and linked them to target genes using co-accessibility and HiChIP. Using a genome-wide CRISPR screen in EndoC-βH1 cells, we identified 867 genes affecting cytokine-induced survival, and genes promoting survival and up-regulated in cytokines were enriched at T1D risk loci. Using SNP-SELEX, we identified 2,229 variants in cytokine-responsive cCREs altering transcription factor (TF) binding, and variants altering binding of TFs regulating stress, inflammation, and apoptosis were enriched for T1D risk. At the 16p13 locus, a fine-mapped T1D variant altering TF binding in a cytokine-induced cCRE interacted with SOCS1, which promoted survival in cytokine exposure. Our findings reveal processes and genes acting in beta cells during inflammation that modulate T1D risk.Peer reviewe
iPSCORE: A Resource of 222 iPSC Lines Enabling Functional Characterization of Genetic Variation across a Variety of Cell Types.
Large-scale collections of induced pluripotent stem cells (iPSCs) could serve as powerful model systems for examining how genetic variation affects biology and disease. Here we describe the iPSCORE resource: a collection of systematically derived and characterized iPSC lines from 222 ethnically diverse individuals that allows for both familial and association-based genetic studies. iPSCORE lines are pluripotent with high genomic integrity (no or low numbers of somatic copy-number variants) as determined using high-throughput RNA-sequencing and genotyping arrays, respectively. Using iPSCs from a family of individuals, we show that iPSC-derived cardiomyocytes demonstrate gene expression patterns that cluster by genetic background, and can be used to examine variants associated with physiological and disease phenotypes. The iPSCORE collection contains representative individuals for risk and non-risk alleles for 95% of SNPs associated with human phenotypes through genome-wide association studies. Our study demonstrates the utility of iPSCORE for examining how genetic variants influence molecular and physiological traits in iPSCs and derived cell lines
Current Performance and On-Going Improvements of the 8.2 m Subaru Telescope
An overview of the current status of the 8.2 m Subaru Telescope constructed
and operated at Mauna Kea, Hawaii, by the National Astronomical Observatory of
Japan is presented. The basic design concept and the verified performance of
the telescope system are described. Also given are the status of the instrument
package offered to the astronomical community, the status of operation, and
some of the future plans. The status of the telescope reported in a number of
SPIE papers as of the summer of 2002 are incorporated with some updates
included as of 2004 February. However, readers are encouraged to check the most
updated status of the telescope through the home page,
http://subarutelescope.org/index.html, and/or the direct contact with the
observatory staff.Comment: 18 pages (17 pages in published version), 29 figures (GIF format),
This is the version before the galley proo
Clustering of Lyman Break Galaxies at z=4 and 5 in The Subaru Deep Field: Luminosity Dependence of The Correlation Function Slope
We explored the clustering properties of Lyman Break Galaxies (LBGs) at z=4
and 5 with an angular two-point correlation function on the basis of the very
deep and wide Subaru Deep Field data. We found an apparent dependence of the
correlation function slope on UV luminosity for LBGs at both z=4 and 5. More
luminous LBGs have a steeper correlation function. To compare these
observational results, we constructed numerical mock LBG catalogs based on a
semianalytic model of hierarchical clustering combined with high-resolution
N-body simulation, carefully mimicking the observational selection effects. The
luminosity functions for LBGs predicted by this mock catalog were found to be
almost consistent with the observation. Moreover, the overall correlation
functions of LBGs were reproduced reasonably well. The observed dependence of
the clustering on UV luminosity was not reproduced by the model, unless
subsamples of distinct halo mass were considered. That is, LBGs belonging to
more massive dark haloes had steeper and larger-amplitude correlation
functions. With this model, we found that LBG multiplicity in massive dark
halos amplifies the clustering strength at small scales, which steepens the
slope of the correlation function. The hierarchical clustering model could
therefore be reconciled with the observed luminosity-dependence of the angular
correlation function, if there is a tight correlation between UV luminosity and
halo mass. Our finding that the slope of the correlation function depends on
luminosity could be an indication that massive dark halos hosted multiple
bright LBGs (abridged).Comment: 16 pages, 17 figures, Accepted for publication in ApJ, Full
resolution version is available at
http://zone.mtk.nao.ac.jp/~kashik/sdf/acf/sdf_lbgacf.pd
A crowdsourced set of curated structural variants for the human genome.
Funder: U.S. Food and Drug Administration; funder-id: http://dx.doi.org/10.13039/100000038A high quality benchmark for small variants encompassing 88 to 90% of the reference genome has been developed for seven Genome in a Bottle (GIAB) reference samples. However a reliable benchmark for large indels and structural variants (SVs) is more challenging. In this study, we manually curated 1235 SVs, which can ultimately be used to evaluate SV callers or train machine learning models. We developed a crowdsourcing app-SVCurator-to help GIAB curators manually review large indels and SVs within the human genome, and report their genotype and size accuracy. SVCurator displays images from short, long, and linked read sequencing data from the GIAB Ashkenazi Jewish Trio son [NIST RM 8391/HG002]. We asked curators to assign labels describing SV type (deletion or insertion), size accuracy, and genotype for 1235 putative insertions and deletions sampled from different size bins between 20 and 892,149 bp. 'Expert' curators were 93% concordant with each other, and 37 of the 61 curators had at least 78% concordance with a set of 'expert' curators. The curators were least concordant for complex SVs and SVs that had inaccurate breakpoints or size predictions. After filtering events with low concordance among curators, we produced high confidence labels for 935 events. The SVCurator crowdsourced labels were 94.5% concordant with the heuristic-based draft benchmark SV callset from GIAB. We found that curators can successfully evaluate putative SVs when given evidence from multiple sequencing technologies